feat(e2e-harness): drive and snapshot the real wizard TUI by gewenyu99 · Pull Request #702 · PostHog/wizard

gewenyu99 · 2026-06-21T15:09:00Z

How to test

Agent route — drive the wizard yourself. In a fresh session in this repo, run the exploring-the-wizard skill. wizard-ci is registered in .mcp.json, so the tools are already bound: open_app boots the real TUI on an app, then read_state / perform_action / render_screen (which returns the real rendered screen).

CI snapshots — real-TUI visual regression. From a wizard-workbench checkout next to this repo (PostHog creds in its .env):

cd ../wizard-workbench && pnpm wizard-ci-snapshots

Runs the full real agent flow against express-todo through the real TUI, captures each key moment, diffs the committed baseline, and writes report.html. Or comment /wizard-ci on a PR — same run, posted back as a comment. (Pairs with PostHog/wizard-workbench#2012.)

What this is

A headless e2e control plane that drives the real wizard TUI and captures what it renders. Both routes share one primitive:

Host (scripts/tui-host.no-jest.ts) runs the real startTUI and drives its store by state manipulation — no keystrokes. Auth uses the phx key (same bearer as an OAuth token), so the TUI advances with no browser.
Capture (e2e-harness/tui-capture.ts) runs the host in a PTY (node-pty) and reads the real rendered screen via @xterm/headless.

Routes:

CI snapshots (tui-snapshots): the fixed e2e profile self-drives the host through the real agent run → one real-TUI text snapshot per key moment (including the run screen's progression), diffed against a committed baseline.
Agent (wizard-ci-mcp): an MCP server proxies the host so an agent decides each screen; render_screen returns the real frame. The exploring-the-wizard skill is the how-to.

None of it ships — it lives in e2e-harness/ + scripts/, out of src/.

github-actions · 2026-06-21T15:09:11Z

🧙 Wizard CI

Run the Wizard CI and test your changes against wizard-workbench example apps by replying with a GitHub comment using one of the following commands:

Test all apps:

/wizard-ci all

Test all apps in a directory:

/wizard-ci basic-integration
/wizard-ci error-tracking-upload-source-maps
/wizard-ci misc
/wizard-ci revenue

Test an individual app:

/wizard-ci basic-integration/android
/wizard-ci basic-integration/angular
/wizard-ci basic-integration/astro

Show more apps

/wizard-ci basic-integration/django
/wizard-ci basic-integration/fastapi
/wizard-ci basic-integration/flask
/wizard-ci basic-integration/javascript-node
/wizard-ci basic-integration/javascript-web
/wizard-ci basic-integration/laravel
/wizard-ci basic-integration/next-js
/wizard-ci basic-integration/nuxt
/wizard-ci basic-integration/python
/wizard-ci basic-integration/rails
/wizard-ci basic-integration/react-native
/wizard-ci basic-integration/react-router
/wizard-ci basic-integration/sveltekit
/wizard-ci basic-integration/swift
/wizard-ci basic-integration/tanstack-router
/wizard-ci basic-integration/tanstack-start
/wizard-ci basic-integration/vue
/wizard-ci error-tracking-upload-source-maps/android
/wizard-ci error-tracking-upload-source-maps/cicd-docker-node-raw
/wizard-ci error-tracking-upload-source-maps/cicd-github-actions-docker-node-raw
/wizard-ci error-tracking-upload-source-maps/cicd-github-actions-nested-docker-node-raw
/wizard-ci error-tracking-upload-source-maps/cicd-github-actions-node-raw
/wizard-ci error-tracking-upload-source-maps/cicd-gitlab-node-raw
/wizard-ci error-tracking-upload-source-maps/cicd-ssh-vps-node-raw
/wizard-ci error-tracking-upload-source-maps/flutter
/wizard-ci error-tracking-upload-source-maps/ios
/wizard-ci error-tracking-upload-source-maps/next
/wizard-ci error-tracking-upload-source-maps/next-no-posthog
/wizard-ci error-tracking-upload-source-maps/node-raw
/wizard-ci error-tracking-upload-source-maps/node-rollup
/wizard-ci error-tracking-upload-source-maps/node-rollup-typescript-plugin
/wizard-ci error-tracking-upload-source-maps/node-webpack
/wizard-ci error-tracking-upload-source-maps/nuxt-3-6
/wizard-ci error-tracking-upload-source-maps/nuxt-4-3
/wizard-ci error-tracking-upload-source-maps/react-native
/wizard-ci error-tracking-upload-source-maps/react-vite
/wizard-ci error-tracking-upload-source-maps/rust
/wizard-ci misc/quack-quack
/wizard-ci revenue/stripe

Results will be posted here when complete.

Same resolved version; just the package.json floor, so #701 and #702 don't conflict on the zod line. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ord/replay A control plane over the TUI store that drives the wizard end-to-end with no terminal and no browser, for CI/e2e and agent-driven testing. The render is a pure function of the nanostore, so driving committed state == driving the UI. Core files (src/lib/ci-driver/): - wizard-ci-driver.ts — read_state / list_actions / perform_action over a live WizardStore. read_state is a truthful, secret-free projection of committed state (+ derived currentScreen); perform_action commits via the exact store setter the Ink screen's key handler calls. - action-registry.ts — declarative screen -> commit-action map (exhaustive over ScreenId/Overlay). The actuation surface: name an action, not a keystroke. - wizard-ci-tools.ts — in-process MCP server exposing the three tools, so an external harness or LLM can drive a real run. - e2e-profile.ts — WizardE2eProfile: a program's declarative e2e test definition (the UI choices). decideE2eAction(state, profile) maps screen -> commit, so the harness is generic and the choices live on the program. - recorder.ts — captures a frame at each key moment (route/task/status/runPhase/ overlay change) off the store's version counter; redacts the access token. - replay.ts — reconstructs a throwaway store per frame and renders the REAL Ink screen back to ANSI, so a run replays in the terminal. - DRIVING-E2E-FROM-AN-AGENT.md — how a future agent drives these. - __tests__/ — control-plane walk, flow snapshot (TUI-snapshot analog), recorder. Programs declare their flow's UI choices: - programs/program-step.ts — ProgramConfig.e2e?: WizardE2eProfile. - programs/posthog-integration/index.ts — the integration program's e2e profile. Harness/entry scripts: - scripts/e2e-full-run.no-jest.ts — headless full run: real WizardStore + InkUI (never rendered) + concurrent driver + real runAgent; emits a structured result + a recording. - scripts/replay-e2e.no-jest.ts — replay a recording in the terminal. - scripts/ci-driver-demo.ts — offline control-plane demo (no agent). Additive; no core wizard behavior changed. The workbench `wizard-ci --e2e` (PostHog/wizard-workbench) orchestrates these against real test apps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The e2e UI-choices object moves out of index.ts into a co-located e2e.ts (POSTHOG_INTEGRATION_E2E_PROFILE), keeping the program config lean and the flow's test definition in its own file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

scripts/record-demo.no-jest.ts — produces a recording offline (no agent, no network) by driving the integration flow with the e2e profile + a WizardRecorder, so `replay-e2e.no-jest.ts` can be tried without a full run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

scripts/README.md documents the manual control-plane + record/replay tools (what each does, what it needs, how to run). Also commits ci-driver-live-agent.ts (real gateway LLM drives the wizard-ci-tools MCP server) so the index is complete. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

main added two confirm-and-continue intro screens (WarehouseIntro, SelfDrivingIntro, both call store.completeSetup()). The action-registry exhaustiveness test flagged them as uncovered. Register both as confirm_setup in ACTION_REGISTRY and in the e2e walk policy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…l refs Move DRIVING-E2E-FROM-AN-AGENT.md → ARCHITECTURE.md to match the co-located subsystem-doc convention (cf. programs/self-driving/ARCHITECTURE.md). Remove content that shouldn't ship in the public repo: the internal test project id + team name, the workbench test-api-key.txt secret file, and pointers to workbench-only scratch files. Keep the architecture, profiles, record/replay, and MCP-loop guidance; generalize the run instructions. Update the scripts/README link. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

scripts/render-snapshots.no-jest.ts renders every key-moment frame of a recording to a real-Ink ANSI snapshot (one <seq>-<screen>.ans per frame), via replay's renderFrame under tsx. These feed the workbench visual-regression flow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

None of the control-plane / recording / e2e machinery belongs in the wizard's production source. Relocate src/lib/ci-driver/ → e2e-harness/ at the repo root (next to e2e-tests/), and sever every prod coupling: - Remove the ProgramConfig.e2e field (program-step.ts) and the on-program profile (delete posthog-integration/e2e.ts, unwire index.ts). Per-program profiles now live in the harness — e2e-harness/profiles.ts, profileFor(programId). - Add an @e2e-harness/* path alias (tsconfig.build.json + jest moduleNameMapper); repoint scripts/tests off @lib/ci-driver. Result: src/ has ZERO references to the harness, and the published tsdown bundle contains none of it (previously the ~90-byte profile object shipped). Full suite (1045 tests, 3 snapshots) passes; real-recording render verified under tsx. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ARCHITECTURE.md now documents the wizard-ci-snapshots visual-regression flow (real run → render → diff → side-by-side report) and the env it needs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…gram A test/ README documents this program's e2e test definition — the path the headless run walks and the option it auto-takes at each screen (confirm intro, dismiss outage, first setup option, skip mcp/slack, delete skills). It's the human description; the runnable profile stays in e2e-harness/profiles.ts. No e2e machinery returns to prod src — this is documentation only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…oads Each program declares its e2e test path as src/lib/programs/<program>/test/e2e.json — a `profile` (the options the headless run auto-takes) plus a documented `path` of every screen. The harness imports the `profile` in e2e-harness/profiles.ts (single source of truth, no prose duplication). Matches the repo's existing JSON-data pattern (mcp-role-prompts.copy.json); resolveJsonModule already on. It's data, imported only by the harness — zero prod imports, absent from the tsdown bundle. Full harness suite + runtime load verified. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add the end-to-end trace (agent → perform_action → driver → action-registry → store.completeSetup → emitChange → router re-resolve → readState) as a comment at the perform_action tool, with cross-referenced breadcrumbs at the driver hop (one committed mutation per call) and the action-registry hop (the store setter + flag-flip the screen sequence reacts to). Harness-only; prod store.ts untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…dule Add a header note to wizard-ci-tools / wizard-ci-driver / action-registry / recorder / replay: each lives in e2e-harness/, is imported only by scripts/tests, and is absent from the tsdown bundle (bin.ts is the only entry). Addresses the "this looks shippable" worry right where a reader meets the code (esp. the MCP server + SDK import). Verified: no e2e symbols in dist/. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Moving the trace / never-ships / credentials notes to PR review comments anchored to the lines instead — keep the source uncluttered. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…-by-turn scripts/wizard-ci-mcp.no-jest.ts is a stdio MCP server over one live WizardStore: read_state / list_actions / perform_action / render_screen / run_agent. An agent registers it and makes every decision live, instead of the static scripted run. Rewrite the exploring-the-wizard skill to lead with this. Bump zod ^3.24→^3.25 (the MCP SDK needs the zod/v3 subpath; non-breaking) and add the SDK as a dep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Same resolved version; just the package.json floor, so #701 and #702 don't conflict on the zod line. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

read_state already returns the legal actions, so the separate tool is noise. Keeps the server's surface minimal: read_state, perform_action, render_screen, run_agent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…hange Running prettier on these (not in lint-staged) reflowed the whole files — pure diff noise. Restore them to main and re-apply just the intended edits: the "Explore with an agent" section + the exploring-the-wizard skill row.

…d runbook EXPLORING-AS-AN-AGENT.md was promoted to .claude/skills/exploring-the-wizard/; this pointer fix was left uncommitted, so HEAD still linked the deleted file.

…ion start The skill told agents to `claude mcp add` then immediately call the tools, which is impossible (MCP servers load at session start), so agents fell back to a script. Lead with the in-session way that actually works — a WizardCiDriver script (read_state → perform_action → renderFrame), tested — and document the MCP server as the interactive option that needs registering before a fresh session.

…with it Connect the stdio transport first and build the store lazily on the first tool call — detection + the networked health probe used to run before connect(), which could stall the MCP handshake so Claude Code saw the server as broken. Verified end-to-end: `claude mcp add` → `claude mcp list` shows ✔ Connected → a headless session drove read_state → perform_action(confirm_setup) → auth → render_screen. Skill now leads with the two-phase MCP flow (register, then drive in a fresh session, since MCP tools bind at session start); the driver script is the fallback.

…drives in one session Register wizard-ci in .mcp.json so its tools are bound in every session in this repo. An agent following the exploring-the-wizard skill now drives the wizard over MCP (open_app -> read_state -> perform_action -> render_screen -> run_agent) without registering anything or starting a fresh session. The server boots app-agnostic; open_app picks the app + key at call time, so the committed config holds no secrets. Skill + README rewritten to the one-session MCP flow. Verified: a fresh headless agent given only the skill drove the wizard with four MCP calls and wrote zero scripts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Just say to point appDir at the directory that has the package.json. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

appDir is just the throwaway copy of the app; let the agent find the path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

auth (and run) are NO_ACTION screens: session.credentials is set only inside bootstrapProgram, which runs via run_agent. So nothing advances past auth without run_agent — but the tool description said "call when currentScreen=run" and the skill walk skipped auth, so an agent landed on auth and polled instead of calling run_agent. Fix the run_agent description and the skill walk/key-facts to say run_agent bootstraps creds and advances auth+run; don't poll those screens. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ves the run A real run_agent call blocked the stdio MCP server for ~3 minutes; the client treated the server as unhealthy, reconnected, and the restarted process lost its in-memory store ("No app open", runPhase reset to idle). run_agent now starts the integration in the background and returns immediately; read_state stays responsive and reports runPhase running -> completed plus an integration status, so the agent polls instead of blocking. Skill + tool descriptions updated to the poll model; noted that run_agent creates real PostHog resources each run. Proven: run_agent returns in 0.0s; read_state during the run answers in 1-2ms with runPhase=running. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…or both routes Both e2e routes run the real wizard TUI (startTUI) driven by store state manipulation — no keystrokes — and capture the real rendered screen from a PTY. Auth is satisfied by setCredentials with the phx key (same bearer as an OAuth token), so the TUI advances with no browser. - e2e-harness/tui-capture.ts — run a command in a PTY (node-pty), read its screen via @xterm/headless. - scripts/tui-host.no-jest.ts — the real-TUI host. MODE=fixed self-drives the fixed e2e profile, signals each screen, writes a structured result JSON; MODE=serve takes drive commands over a unix socket. - scripts/tui-snapshots.no-jest.ts — CI route: real-TUI text snapshot per screen. - scripts/wizard-ci-mcp.no-jest.ts — agent route: MCP server proxying the host. - scripts/wizard-ci-explore.no-jest.ts — drive the MCP route, print the real TUI. - scripts/tui-replay.no-jest.ts — replay captured snapshots in the terminal. Deletes the record-then-reconstruct machinery (recorder, replay, e2e-full-run, render-snapshots, replay-e2e) and the in-process wizard-ci-tools server. Adds node-pty + @xterm/headless. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…sition Snapshot on key moments — a screen change, a task-list update, or a runPhase change — via a store subscription, and snap each screen before the driver acts on it. The run screen (the agent working) is captured as it progresses, and fast transitions (intro/auth/outro/mcp/slack) are no longer skipped by throttling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ed loop Snapshot on every key-moment change (no throttle spacing, just a settle). And don't await the driver loop at exit — on the cheap (no-agent) path it's parked in waitForChange, so awaiting it hung the process and exited non-zero, which would fail CI. The process now exits 0 cleanly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The fixed CI route always drives the full real agent run — a no-agent path was pointless (and is what hung at exit). Removes the RUN_AGENT branch and the auth-by-state shortcut it needed in fixed mode; auth is bootstrapped by the run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

node-pty ships no linux-x64 prebuilt, so CI must compile it; pnpm 10 blocks build scripts unless allowlisted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ink renders non-interactively when it detects CI (CI / GITHUB_ACTIONS), leaving the captured xterm buffer blank. Strip them from the spawned host's env. Verified locally: with CI=true, render_screen now returns the real TUI instead of blank. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

main added the source-maps detection screen; the action-registry exhaustiveness test requires every screen be actionable or explicitly no-action. The integration e2e profile never enters the source-maps program, so it joins the other non-integration screens in NO_ACTION_SCREENS, with a note to wire it in when a source-maps profile drives that program. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

postbuild copies scripts/ into dist (which ships); drop the *.no-jest.* e2e/CI scripts from dist so the published wizard carries only runtime scripts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- drop a stray blank line from posthog-integration config (no prod diff) - extract the shared intro/health-check/run sequence in tui-host - pass projectId to getOrAskForProjectData as a number (its declared type) - strip host AI_AGENT alongside CLAUDE/ANTHROPIC, matching the workbench Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

edwinyjlim

good since it's all additive

- never write an inline api key to disk; pass it to the host via env (POSTHOG_PERSONAL_API_KEY), same as the CI path. A caller-supplied keyFile is still used as-is. - surface a failed run's error in read_state (integrationError) so CI and the agent see why the integration failed instead of a bare 'failed'. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Spell out the explore walk (open_app, snapshot each key moment, act, run_agent, finish) and have it save numbered render_screen frames to /tmp/wz-explore-snaps, matching the CI route's .txt frames. Align the skill's snapshot guidance with the README example. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Resolved 5 conflicts from main's #702/#725/#726: - runner/index.ts: combined our idempotent flushScanReport finalizer (registerCleanup + finally, return await) with main's stampVariant() calls in both fork arms - constants.ts: kept WIZARD_WARLOCK_DISABLED_FLAG_KEY; took main's removal of WIZARD_VARIANTS (variant is now runner-derived via stampVariant) - package.json: kept both new deps (@vitest/coverage-v8 + @xterm/headless); dropped main's re-added root jest config block (root is vitest now; e2e-tests keeps its own jest config) - tsconfig.json: added main's e2e-harness to include; kept our e2e-tests exclusion (standalone jest package, not in the vitest root typecheck) - pnpm-lock.yaml: regenerated via pnpm install Canonicalized main's new e2e-harness snapshots to vitest key format (content unchanged; jest used "describe test", vitest uses "describe > test"). Full suite green: 987 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gewenyu99 mentioned this pull request Jun 21, 2026

fix(security): stop ANTHROPIC_BASE_URL settings overrides redirecting the agent off the PostHog gateway #703

Draft

gewenyu99 marked this pull request as ready for review June 22, 2026 20:57